descriptive text
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
Kim, Minsu, Ma, Pingchuan, Chen, Honglie, Petridis, Stavros, Pantic, Maja
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks
Liang, Chia Xin, Tian, Pu, Yin, Caitlyn Heqi, Yua, Yao, An-Hou, Wei, Ming, Li, Wang, Tianyang, Bi, Ziqian, Liu, Ming
This survey and application guide to multimodal large language models(MLLMs) explores the rapidly developing field of MLLMs, examining their architectures, applications, and impact on AI and Generative Models. Starting with foundational concepts, we delve into how MLLMs integrate various data types, including text, images, video and audio, to enable complex AI systems for cross-modal understanding and generation. It covers essential topics such as training methods, architectural components, and practical applications in various fields, from visual storytelling to enhanced accessibility. Through detailed case studies and technical analysis, the text examines prominent MLLM implementations while addressing key challenges in scalability, robustness, and cross-modal learning. Concluding with a discussion of ethical considerations, responsible AI development, and future directions, this authoritative resource provides both theoretical frameworks and practical insights. It offers a balanced perspective on the opportunities and challenges in the development and deployment of MLLMs, and is highly valuable for researchers, practitioners, and students interested in the intersection of natural language processing and computer vision.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > Indiana (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (5 more...)
- Workflow (1.00)
- Research Report > Promising Solution (1.00)
- Overview (1.00)
- (2 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Services (1.00)
- (9 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- (8 more...)
ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description
Guo, Xiao-Yu, Li, Yi-Fan, Liu, Yuan, Pan, Xiaoyong, Shen, Hong-Bin
Protein design has become a critical method in advancing significant potential for various applications such as drug development and enzyme engineering. However, protein design methods utilizing large language models with solely pretraining and fine-tuning struggle to capture relationships in multi-modal protein data. To address this, we propose ProtDAT, a de novo fine-grained framework capable of designing proteins from any descriptive protein text input. ProtDAT builds upon the inherent characteristics of protein data to unify sequences and text as a cohesive whole rather than separate entities. It leverages an innovative multi-modal cross-attention, integrating protein sequences and textual information for a foundational level and seamless integration. Experimental results demonstrate that ProtDAT achieves the state-of-the-art performance in protein sequence generation, excelling in rationality, functionality, structural similarity, and validity. On 20,000 text-sequence pairs from Swiss-Prot, it improves pLDDT by 6%, TM-score by 0.26, and reduces RMSD by 1.2 {\AA}, highlighting its potential to advance protein design.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Education > Health & Safety > School Nutrition (0.30)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Domain-Independent Automatic Generation of Descriptive Texts for Time-Series Data
Dohi, Kota, Ito, Aoi, Purohit, Harsh, Nishida, Tomoya, Endo, Takashi, Kawaguchi, Yohei
Due to scarcity of time-series data annotated with descriptive texts, training a model to generate descriptive texts for time-series data is challenging. In this study, we propose a method to systematically generate domain-independent descriptive texts from time-series data. We identify two distinct approaches for creating pairs of time-series data and descriptive texts: the forward approach and the backward approach. By implementing the novel backward approach, we create the Temporal Automated Captions for Observations (TACO) dataset. Experimental results demonstrate that a contrastive learning based model trained using the TACO dataset is capable of generating descriptive texts for time-series data in novel domains.
How to Use Large Language Models for Text Coding: The Case of Fatherhood Roles in Public Policy Documents
Lupo, Lorenzo, Magnusson, Oscar, Hovy, Dirk, Naurin, Elin, Wängnerud, Lena
Recent advances in large language models (LLMs) like GPT-3 and GPT-4 have opened up new opportunities for text analysis in political science. They promise automation with better results and less programming. In this study, we evaluate LLMs on three original coding tasks of non-English political science texts, and we provide a detailed description of a general workflow for using LLMs for text coding in political science research. Our use case offers a practical guide for researchers looking to incorporate LLMs into their research on text analysis. We find that, when provided with detailed label definitions and coding examples, an LLM can be as good as or even better than a human annotator while being much faster (up to hundreds of times), considerably cheaper (costing up to 60% less than human coding), and much easier to scale to large amounts of text. Overall, LLMs present a viable option for most text coding projects.
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States (0.04)
- (7 more...)
- Law (1.00)
- Government (1.00)
- Education (0.93)
How to start your adventure with AI art ?
Before I answer this question, two words of introduction. What you see in the picture below is a model that makes text look like an image -- AttnGAN (Attentional Generative Adversarial Networks). It changes descriptive texts into synthesized images. Thanks to its innovative generative network, AttnGAN it can synthesize small details in different subregions of an image, paying attention to the appropriate words in the natural language description. The text that answers the question of "what an AI is" is not a descriptive text, but let us see what our GAN will generate.
BabyAI++: Towards Grounded-Language Learning beyond Memorization
Cao, Tianshi, Wang, Jingkang, Zhang, Yining, Manivasagam, Sivabalan
Despite success in many real-world tasks (e.g., robotics), reinforcement learning (RL) agents still learn from tabula rasa when facing new and dynamic scenarios. By contrast, humans can offload this burden through textual descriptions. Although recent works have shown the benefits of instructive texts in goal-conditioned RL, few have studied whether descriptive texts help agents to generalize across dynamic environments. To promote research in this direction, we introduce a new platform, BabyAI++, to generate various dynamic environments along with corresponding descriptive texts. Moreover, we benchmark several baselines inherited from the instruction following setting and develop a novel approach towards visually-grounded language learning on our platform. Extensive experiments show strong evidence that using descriptive texts improves the generalization of RL agents across environments with varied dynamics.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Data Storytelling: Separating Fiction From Facts
Data is playing a larger role in day-to-day business conversations than ever before. The ability to communicate with data is now a necessity for business leaders, frontline employees, and everybody in between. People who may have easily avoided discussing data in the past are finding numbers being thrust upon them. When data is a foreign language to you, it can be frustrating to not understand what's being said or be able to use it effectively in communications with others. Not being conversant or fluent in data is quickly becoming a liability in today's fast-moving data economy.
Using Artificial Intelligence to Generate Alt Text on Images CSS-Tricks
Web developers and content editors alike often forget or ignore one of the most important parts of making a website accessible and SEO performant: image alt text. If you regularly publish content on the web, then you know it can be tedious trying to come up with descriptive text. Sure, 5-10 images is doable. But what if we are talking about hundreds or thousands of images? Do you have the resources for that?